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Abstract 

Lexical co-occurrence is an important cue for 
detecting word associations. We present a 
theoretical framework for discovering statis- 
tically significant lexical co-occurrences from 
a given corpus. In contrast with the preva- 
lent practice of giving weightage to unigram 
frequencies, we focus only on the documents 
containing both the terms (of a candidate bi- 
gram). We detect biases in span distributions 
of associated words, while being agnostic to 
variations in global unigram frequencies. Our 
framework has the fidelity to distinguish dif- 
ferent classes of lexical co-occurrences, based 
on strengths of the document and corpus- 
level cues of co-occurrence in the data. We 
perform extensive experiments on benchmark 
data sets to study the performance of vari- 
ous co-occurrence measures that are currently 
known in literature. We find that a relatively 
obscure measure called Ochiai, and a newly 
introduced measure CSA capture the notion of 
lexical co-occurrence best, followed next by 
LLR, Dice, and TTest, while another popular 
measure, PMI, suprisingly, performs poorly in 
the context of lexical co-occurrence. 



1 Introduction 

The notion of word association is important for 
numerous NLP applications, like, word sense dis- 
ambiguation, optical character recognition, speech 
recognition, parsing, lexicography, natural language 
generation, and machine translation. Lexical co- 
occurrence is an important indicator of word asso- 
ciation and this has motivated several frequency- 
based measures for word association ( [Church and] 



and Markert, 2009] ). In this paper, we present a 
theoretical basis for detection and classification of 
lexical co-occurrence^] In general, a lexical co- 
occurrence could refer to a pair of words that occur 
in a large number of documents; or it could refer 
to a pair of words that, although appear only in a 
small number of documents, occur frequently very 
close to each other within each document. We for- 
malize these ideas and construct a significance test 
for co-occurrences that will allow us to detect dif- 
ferent kinds of co-occurrences within a single uni- 
fied framework (a feature which is absent in current 
measures for co-occurrence). As a by-product, our 
framework also leads to a better understanding of 
existing measures for word co-occurrence. 



As pointed out in (Kilgarriff, 2005), language is 
never random - which brings us to the question of 
what model of random chance can give us a good 
statistical test for lexical co-occurrences. We need a 
null hypothesis that can account for an observed co- 
occurrence as a pure chance event and this in-turn re- 
quires a corpus generation model. It is often reason- 
able to assume that documents in the corpus are gen- 
erated independent of each other. Existing frequecy- 
based association measures like PMI ( |Church and] 



Hanks, 1989 1, LLR (Dunning, 1993 1 etc. further as- 



sume that each document is drawn from a multino- 
mial distribution based on global unigram frequen- 
cies. The main concern with such a null model is 
the overbearing influence of unigram frequencies on 
the detection of word associations. For example, 



Hanks, 1989 Dunning, 1993 Dice, 1945 Washtell 



'Note that we are interested in co-occurrence, not colloca- 
tion, i.e., pairs of words that co-occur in a document with an ar- 
bitrary number of intervening words. Also, we use the term bi- 
gram to mean bigram at-a-distance or spanned-bigram - again, 
other words can occur in-between the constituents of a bigram. 



the association between anomochilidae (dwarf pipe 
snakes) and snake would go undetected in our wike- 
pedia corpus, since less than 0.1% of the pages con- 
taining snake also contained anomochilidae. Simi- 
larly, under current models, the expected span (inter- 
word distance) of a bigram is also very sensitive to 
the associated unigram frequencies: the expected 
span of a bigram composed of low frequency uni- 
grams is much larger than that with high frequency 
unigrams. This is contrary to how word associa- 
tions appear in language, where semantic relation- 
ships manifest with small inter-word distances irre- 
spective of the underlying unigram distributions. 

These considerations motivate our search for a 
more direct relationship between words, one that can 
potentially be detected using careful statistical char- 
acterization of inter-word distances, while minimiz- 
ing the influence of the associated unigram frequen- 
cies. We focus on only the documents containing 
both the terms (of a candidate bigram) since in NLP 
applications, we often have to chose from a set of 
alternatives for a given word. Hence, rather than ask 
the abstract question of whether words x and y are 
related, our approach is to ask, given that y is a can- 
didate for pairing with x, how likely is it that x and 
y are lexically correlated. For example, probability 
that anomochilidae is found in the vicinity of snake 
is higher if we knew that anomochilidae and snake 
appear in the same context. 

We consider a null model that represents each 
document as a bag of words Then, a random per- 
mutation of the associated bag of words gives a lin- 
ear representation for the document. An arbitrary re- 
lation between a pair of words will result in the loca- 
tions of these words to be randomly distributed in the 
documents in which they co-occur. If the observed 
span distribution of a bigram resembles that under 
the (random permutation) null model, then the rela- 
tion between the words is not strong enough for one 
word to influence the placement of the other. How- 
ever, if the words are found to occur closer together 
than explainable by our null model, then we hypoth- 
esize existence of a more direct association between 
these words. 

2 There can be many ways to associate a bag of words with a 
document. Details of this association are not important for us, 
except that the bag of words provides some kind of quantitative 
summary of the words within the document. 



In this paper, we formalize the notion of statis- 
tically significant lexical co-occurrences by intro- 
ducing a null model that can detect biases in span 
distributions of word associations, while being ag- 
nostic to variations in global unigram frequencies. 
Our framework has the fidelity to distinguish dif- 
ferent classes of lexical co-occurrences, based on 
strengths of the document and corpus-level cues of 
co-occurrence in the data. We perform extensive ex- 
periments on benchmark data sets to study the per- 
formance of various co-occurrence measures that are 
currently known in literature. We find that a rela- 
tively obscure measure called Ochiai, and a newly 
introduced measure CSA, capture the notion of lexi- 
cal co-occurrence best, followed next by LLR, Dice, 
and TTest, while another popular measure, PMI, 
suprisingly, performs poorly in the context of lexi- 
cal co-occurrence. 

2 Lexically significant co-occurrences 

Consider a bigram a. Let V = {D\, . . . , Dk} de- 
note the set of K documents (from out of the en- 
tire corpus) that contain at least one occurrence of a. 
The. frequency of a in document Di, fi, is the maxi- 
mum number of non-overlapped occurrences of a in 
Di. A set of occurrences of a bigram are called non- 
overlapping if the words corresponding to one oc- 
currence from the set do not appear in-between the 
words corresponding to any other occurrence from 
the set. 

The span of an occurrence of a is the 'unsigned 
distance' between the first and last textual units of 
interest associated with that occurrence. We mostly 
use words as the unit of distance, but in general, dis- 
tance can be measured in words, sentences, or even 
paragraphs (e.g. an occurrence comprising two ad- 
jacent words in a sentence has a word-span of one 
and a sentence-span of zero). Likewise, the size of 
a document Di, denoted as ti, is correspondingly 
measured in units of words, sentences or paragraphs. 
Finally, let fi denote the maximum number of non- 
overlapped occurrences of a in Di with span less 
than a given threshold x. We refer to fi as the span- 
constrained frequency of a in Di. Note that f can- 
not exceed fi. 

To assess the statistical significance of the bigram 
a we ask if the span-constrained frequency f (of a) 



is more than what we would expect for it in a docu- 
ment of size £i containing fi 'random' occurrences 
of a. Our intuition is that if two words are seman- 
tically related, they will often appear close to each 
other in the document and so the distribution of the 
spans will typically exhibit a prominent bias toward 
values less than a small x. 

Consider the null hypothesis that a document 
is generated as a random permutation of the bag 
of words associated with the document. Let 
7V x (f,f,£) denote the probability of observing a 
span-constrained frequency (for a) of at least f in 
a document of length £ that contains a maximum of 
/ non-overlapped occurrences of a. Observe that 
ir x (0, f,£) = l for any x > 0; also, for x > £ we 
have Tr x (f, f,£) = l (i.e. all / occurrences will al- 
ways have span less than x for x > £). However, 
for typical values of x (i.e. for x <C £) the proba- 
bility ir x (f,f,£) decreases with increasing /. For 
example, consider a document of length 400 with 4 
non-overlapped occurrences of a. The probabilities 
of observing at least 4, 3, 2, 1 and occurrences of a 
within a span of 20 words are 0.007, 0.09, 0.41, 0.83, 
and 1.0 respectively. Since 7r 2 o(3, 4, 400) = 0.09, 
even if 3 of the 4 occurrences of a (in the example 
document) have span less than 20 words, there is 9% 
chance that the occurrences were a consequence of 
a random event (under our null model). As a result, 
if we desired a confidence-level of at least 95%, we 
would have to declare a as insignificant. 

Given an e (0 < e < 1) and a span upper- 
bound x (> 0) the document D, L is said to sup- 
port the hypothesis "a is a e-significant bigram" if 
n x (fi, fi,£) < e. We refer to e as the document- 
level lexical co-occurrence of a. Define indicator 
variables Zj, i = 1, . . . , K as: 

_ f 1 tfir x (fi,fi,£) < e 
1 1 otherwise 
Let Z = Yli=i z i'> % models the number of doc- 
uments (out of K) that support the hypothesis "a is 
a e-significant bigram." The expected value of Z is 
given by 

K 

E(Z) = J2E( Zi ) (2) 

i=i 

K 

= Y,*Mfi,k)JiA) (3) 



where g e (fi,£i) denotes the smallest / for which 
we can get Tr x (f,fi,£i) < e (This quantity is 
well-defined since n x (f,fi,£i) is non-increasing 
with respect to /). For the example given earlier, 
50.2 (4, 400) = 3 and g om {4, 400) = 4. 
Using Hoeff ding's Inequality, for t > 0, 

P[Z > E(Z) + Kt] < exp(-2Kt 2 ) (4) 

Therefore, we can bound the deviation of the ob- 
served value of Z from its expectation by chosing 
t appropriately. For example, in our corpus, the bi- 
gram (canyon, landscape) occurs in K = 416 doc- 
uments. For e = 0.1, we find that Z = 33 doc- 
uments (out of 416) have e-significant occurrences, 
while E{Z) is 14.34. Let 5 = .01. By setting t = 
yJhx8/(-2K) = .07, we get E(Z) + Kt = 43.46, 
which is greater than the observed value of Z (=33). 
Thus, we cannot be 99% sure that the occurrences 
of (canyon, landscape) in the 33 documents were 
a consequence of non-random phenomena. Hence, 
our test declares (canyon, landscape) as insignificant 
ate = .1,6 = .01. We formally state the signifi- 
cance test for lexical co-occurrences next: 

Definition 1 (Significant lexical co-occurrence) 

Consider a bigram a and a set of K documents 
containing at least one occurrence of a. Let 
Z denote the number of documents (out of K) 
that support the hypothesis "a is an e-significant 
bigram (for a given e > 0, x > Oj". The 
K occurrences of the bigram a are regarded 
e-significant with confidence (1 — 5) (for some 
user-defined 5 > 0)ifwe have [Z > E(Z) + Kt], 
where t = y^og 8/(-2K) and E(Z) is given by 
Eq. Q. The ratio [Z/(E(Z) + Kt)] is called the 
Co-occurrence Significance Ratio ( CSR)for a. 

We now describe how to compute ir x (fi, fi,£i) 
for a in Dj. Let N(fi,£i) denote the number of 
ways of embedding fa non-overlapped occurrences 
of a in a document of length Similarly, let 
N x ( fi, fi,£i) denote the number of ways of embed- 
ding fi non-overlapped occurrences of a in a doc- 
ument of length £i, in such a way that, at least /j 
of the fi occurrences have span less than x. Recall 
that n x ( fi, fi,£i) denotes the probability of observ- 
ing a span-constrained frequency (for a) of at least 
fi in a document of length £: t that contains a maxi- 
mum of fi non-overlapped occurrences of a. Thus, 



we can assign the probability n x (fi, fi,£i) in terms 
of N(fi,£i) and N x (fi, as follows: 

To compute N(f h £i) and Njji, fi,£i), we es- 
sentially need the histogram for / given / and £. Let 
histf t e[f] denote the number of ways to embed / 
non-overlapped occurrences of a bigram in a docu- 
ment of length £ in such a way that exactly / of the 
/ occurrences satisfy the span constraint x. We can 
obtain N(f ir £i) and N x (fi, fi,£i) from hist ftA us- 
ing 

fi 

N x (fi,fi,£i) = ^hist fiA [k] (6) 

fi 

N(fi,£i) = ^hist fiA [k] (7) 

A:=0 



Algorithm 1 ComputeHist(f,£) 
Input: £ - length of document; / - number of non- 
overlapped occurrences to be embedded; x - span 
constraint for occurrences 
Output: histfj[-] - histogram of / when / occurrences 
are embedded in a document of length t 
1: Initialize histf,e[f] <— for / = 0, . . . , / 
2: iff>£ then 
3: return histf t t 
4: if / = then 
5: histf^[0] <- 1; 
6: return histf^ 
7: for % <- 1 to {£- 1) do 
8: for j <r- (i+ 1) to^do 

9: histf-i^-j <— ComputeHist(f — 1,1 — j) 
10: for k <- to / - 1 do 
11: if (j — i) < x then 

12: histfj[k + 1] «- histf,i{k + 1] + 

histf—i^—j [k] 
13: else 

14: histf, z[k] /lisff ^[fc] + /iist/_i^_j[fc] 

15: return histf g 



Algorithm [7] lists the pseudocode for computing 
the histogram /i/^. It enumerates all possible ways 
of embedding / non-overlapped occurrences of a bi- 
gram in a document of length £. The main steps in 



the algorithm involve selecting a start and end posi- 
tion for embedding the very first occurrence (lines 7- 
8) and then recursively calling ComputeHist(- , •) 
(line 9). The i-loop selects a start position for the 
first occurrence of the bigram, and the j-loop se- 
lects the end position. The task in the recursion step 
is to now compute the number of ways to embed 
the remaining (/ — 1) non-overlapped occurrences 
in the remaining (£ — j) positions. Once we have 
histf_i^_j, we need to check whether the occur- 
rence introduced at positions (i, j) will contribute to 
the / count. If (j — i) < x, whenever there are k 
span-constrained occurrences in positions (j + 1) to 
I, there will be (k+ 1) span-constrained occurrences 
in positions 1 to £. Thus, we increment histf^k+1] 
by the quantity histf-i£-j[k] (lines 10-12). How- 
ever, if (j — i) > x, there is no contribution to the 
span-constrained frequency from the occur- 
rence, and so we increment histf^[k] by the quan- 
tity histf-ij-jlk] (lines 10-11, 13-14). 

This algorithm is exponential in / and I, but it 
does not depend explicitly on the data. This allows 
us to populate the histogram off-line, and publish the 
Tr x (f, f,£) tables for various x, f, f and £. (If the 
paper is accepted, we will make an interface to this 
table publicly available). 

3 Utility of CSRtest 

Evidence for significant lexical co-occurrences can 
be gathered at two levels in the data - document- 
level and corpus-level. First, at the document level, 
we may find that a surprisingly high proportion of 
occurrences within a document (of a pair of words) 
have smaller spans than they would by random 
chance. Second, at the corpus-level, we may find 
a pair of words appearing closer-than-random in an 
unusually high number of documents in the corpus. 
The significance test of Definition [7] is capable of 
gathering both kinds of evidence from data in care- 
fully calibrated amounts. Prescribing e essentially 
fixes the strength of the document-level hypothe- 
sis in our test. A small e corresponds to a strong 
document-level hypothesis and vice-versa. The sec- 
ond parameter in our test, 5, controls the confidence 
of our decision given all the documents in the data 
corpus. A small 5 represents a high confidence test 
(in the sense that there are a surprisingly large num- 



ber of documents in the corpus, each of which, indi- 
vidually have some evidence of relatedness for the 
pair of words). By running the significance test 
with different values of e and 5, we can detect dif- 
ferent types of lexically significant co-occurrences. 
We illustrate the utility of our test of significance 
by considering the 4 types of lexical significant co- 
occurrences 

Type A: These correspond to the strongest lexical 
co-occurrences in the data, with strong document- 
level hypotheses (low e) as well as high corpus-level 
confidence (low 5). Intuitively, if a pair of words ap- 
pear close together several times within a document, 
and if this pattern is observed in a large number of 
documents, then the co-occurrence is of Type A. 

Type B: These are co-occurrences based on weak 
document-level hypotheses (high e) but because of 
repeated observation in a substantial number of doc- 
uments in the corpus, we can still detect them with 
high confidence (low 5). We expect many interesting 
lexical co-occurrences in text corpora to be of Type 
B pairs of words that appear close to each other only 
a small number of times within a document, but they 
appear together in a large number of documents. 

Type C: Sometimes we may be interested in words 
that are strongly correlated within a document, even 
if we observe the strong correlation only in a rel- 
atively small number of documents in the corpus. 
These correspond to Type C co-occurrences. Al- 
though they are statistically weaker inferences than 
those of Type A and Type B (since confidence (1—5) 
is lower) Type C co-occurrences represent an impor- 
tant class of relationships between words. If the doc- 
ument corpus contains a very small of number doc- 
uments on some topic, then strong co-occurrences 
(i.e. those found with low e) which are unique to 
that topic may not be detected at low values of 5. 
By relaxing the confidence parameter 5, we may be 
able to detect such occurrences (possibly at the cost 
of some extra false positives). 

Type D: These co-occurrences represent the weak- 
est correlations found in the data, since they neither 
employ a strong document-level hypothesis nor en- 
force a high corpus-level confidence. In most ap- 
plications, we expect Type D co-occurrences to be 
of little use, with their best case utility being to 
provide a baseline for disambiguating Type C co- 
occurrences. 



Type 


e 


5 


A 


< 0.1 


< 0.1 


B 


> 0.4 


< 0.1 


C 


< 0.1 


> 0.4 


D 


> 0.4 


> 0.4 



Table 1: 4 types of lexical co-occurrences. 
In the experiments we describe later, we fix the e 
and 5 for the different Types as per Table [T] Finally, 
we note that Types B and C subsume Type A; simi- 
larly, Type D subsumes all three other types. Thus, 
to detect co-occurrences that are exclusively of (say) 
Type B, we would have to run the test with a high e 
and low 5 and then remove from the output, those 
co-occurrences that are also part of Type A. 

4 Experimental Results 

4.1 Datasets and Text Corpus 

Since similarity and relatedness are different kinds 



of word associations (Budanitsky and Hirst, 2006), 
in ( |Agirre et al, 2009) two different data sets, 
namely 203 words sim (the union of similar and un- 
related pairs) and 252 words rel (the union of related 
and unrelated pairs) datasets are derived from word- 



sim (Finkelstein et al., 2002 1. We use these two data 



sets in our experiments. These datasets are symmet- 
ric in that the order of words in a pair is not expected 
to matter. As some of our chosen co-occurrence 
measures are asymmetric, we also report results on 
the asymmetric 272-words esslli dataset for the 'free 
association' task at ( |ESSLLI, 2008] ). 

We use the Wikipedia ( |Wikipedia, April 20"08] > 
corpus in our experiments. It contains 2.7 million 
articles for a total size of 1.24 Gigawords. We did 
not pre-process the corpus - no lemmatization, no 
function-word removal. When counting document 
size in words, punctuation symbols were ignored. 
Documents larger than 1500 words were partitioned 
keeping the size of each part to no greater than 1500 
words. 



In Table 4. 1 we present some examples of differ- 
ent types of co-occurrences observed in the data. 

4.2 Performance of different co-occurrence 
measures 

We now compare the performance of various 
frequency-based measures in the context of lexical 
significance. Given the large numbers of measures 



Dataset 


Type A bigrams 


Type B bigrams 


Type C bigrams 


Type D bigrams 


sim 


announcement-news 
bread-butter 


forest-graveyard 
tiger-carnivore 


lobster-wine 
lad-brother 


stock-egg 
cup-object 


rel 


baby-mother 
country-citizen 


alcohol-chemistry 
physics-proton 


victim-emergency 
territory-kilometer 


money-withdrawal 
minority-peace 


esslli 


arrest-police 
arson-fire 


pamphlet-read 
spindly-thin 


meditate-think 
ramble-walk 


fairground-roundabout 



Table 2: Examples of Type A, B, C and D co-occurrences under a span constraint of 20 words. 



proposed in the literature ( [Pecina and Schlesinger, 
2006| ), we need to identify a subset of measures to 



compare. Inspired by ( [Janson and Vegelius, 198 1] > 
and ( Tan et al., 2006[ ) we identify three properties 
of co-occurrence measure which may be useful for 
language processing applications. First is Symme- 
try - does the measure yield the same association 
score for (x,y) and (y,x)? Second is Null Addi- 
tion - does addition of data containing neither x nor 
y affect the association score for (x,y)? And, fi- 
nally, Homogenity - if we replicate the corpus sev- 
eral times and merge them to construct a larger cor- 
pus, does the association score for (x,y) remain un- 
changed? Note that the concept of homogenity con- 
flicts with the notion of statistical support, as sup- 
port increases in direct proportion with the abso- 
lute amount of evidence. Different applications may 
need co-occurrence measures having different com- 
binations of these properties. 

Table [3] shows the characteristics of our chosen 
co-occurrence measures, which were selected from 
several domains like ecology, psychology, medicine, 
and language processing. Except Ochiai ( [Ochiai, 
1957| ), ( jJanson and~V egelius , 1981[ ), and the recently 



introduced measure CWCD (Was hteli and Markert,| 
|2009| p| all other selected measures are well-known 



in the NLP community (Pecina and Schlesinger, 
2006| ). Based on our extensive study of theoretical 



and empirical properties of CSR, we also introduce 
a new bigram frequency based measure called CSA 
(Co-occurrence Significance Approximated), which 
approximates the behaviour of CSR over a wide 
range of parameter settings. 



From various so-called windowless measures introduced 
in ( |Washteira nd Marke rt72009| l, we chose the best-performing 
variant Cue-Weighted Co-Dispersion (CWCD) and imple- 
mented a window based version of it with harmonic mean. We 
note that any of windowless (or spanless) measure can easily be 
thought of as a special case of a window-based measure where 
the windowless formulation corresponds to a very large window 
(or span in our terminology). 



Method 



Formula 



CSR 
work) 



(this Z/(E{Z) + Kt) 



CSA (this 
work) 



LLR <Dunning,| T.pW ,y')log^S^ 

l 1993 F 



PMI piurchl | log 



and 



Hanks. 



p(x)p(y) 



19891 



SCI fWashtell 



and Markert, 
20091 



p(x,y) 
p(x)\/p(y) 



CWCD ( V, ashtell ' /m "^ |jM1 



|and M arkert, 
|2009[ 



p(x) 



Pearson's x 
test 



E 



(/>' ,y')-Ef(x' ,y')) 
Bf(x',y>) 



T-test 



f(x,y)-Ef(x,y) 



f(=c,y) 



Dice (Dice, | 
[1945) 



2f(x,y) 
f(x)+f(y) 



Ochiai jjanson 



and Vegelius. 
1981) 



f(x,y) 



Jaccard jjac 
|card, 1 912f 



f(x,y) 



f(x)+f(y)-f(x,y) 



Terminology: (x 1 £ {a;, —<x} and y' £ {y, —•y}) 
N Total number of tokens in the corpus 

f( x ) i f{y) unigram frequencies of x, y in the corpus 

P(x),p(y) f(x)/N,f(y)/N 

f(x,y) Span-constrained (x, y) bigram frequency 

M Harmonic mean of the spans of f(x, y) occurrences 

Ef(x,y) Expected value of f(x,y) 

Table 3: Properties of selected co-occurrence measures 







Span Threshold 


— — 

Measure 


— — 

Data 


JW 


25w 


50 w 




sim 








PMI 


rel 










essli 










sim 








CWCD 


rel 










essli 










sim 


A R C V\ 

r\, 15, U 


A R C 
rt, 13, v 


A R C 
r\, 13, v. 


CSA 


rel 


A Ti p r\ 


A, ±5, L. 


A, C 




essli 


A, B, C, D 


A, B, C 


A, C 




sim 


A R C V\ 


A R C 
r\, 13, v. 


A R 


Dice 


rel 


A Ti p r\ 
A, fc> , l_, L) 








essli 










sim 


A R C V\ 


A R c n 

/\, 13, V^, U 


a r r 1 

rt, 13, v 


Or* lii m 


rel 


A Ti p r\ 
A, D, C, L) 


A, ±3, L. 


A D P 

A, ±3, L. 




essli 


A, B, C, D 


A, B 


A 




sim 


A B C D 


A B 


A 


LLR 


rel 


A, B, C, D 


A 


A 




essli 


A, B, C 


A 


A 




sim 


A,B,C 


A 




TTest 


rel 


A, B, C 








essli 










sim 








SCI 


rel 










essli 









Table 4: Types of lexical co-occurrences detected by dif- 
ferent measures 

In our experiments, we found that Ochiai and 

Chi-Square have almost identical performance, dif- 
fering only in 3rd decimal digits. This can be be 
explained easily. In our context, for any word x, 
as defined in Table [5} f(x) << N and therefore 
p(x) << 1. With this, Chi-Square reduces to 
square of Ochiai. Similarly Jaccard and Dice co- 
incide, since f(x,y) « f(x) and f(x,y) « 
f(y). Hence we do not report further results for Chi- 
Square and Jaccard. 

In our first set of experiments, we compared the 
performance of various frequency-based measures 
in terms of their suitability for detecting lexically 
significant co-occurrences (cf. Definition^. A high 
Spearman correlation coefficient between the ranked 
list produced by a given measure and the list pro- 
duced by CSR with respect to some choice of e and 
5 would imply that the measure is effective in de- 
tecting the corresponding type of lexically signifi- 
cant co-occurrences. 

The Table [4] lists for each measure and for each 
data set, the different types of lexically significant 
co-occurrences that the measure is able to detect ef- 
fectively - if the corresponding Spearman correla- 
tion coefficient exceeds 0.90, we consider the mea- 
sure to be effective for the given type. Results are 
shown for three different span constraints - small 
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Table 5: Best performing (e, (5)-pairs for different mea- 
sures on sim data 

span of 5 words (or 5w), medium span of 25 words 
(or 25w) and large span of 50 words (or 50w). For 
example, the CSA and Ochiai measures are effec- 
tive in detecting all 4 types of lexically significant 
co-occurrences (A, B, C and D) in all three data sets, 
when the span constraint is set to 5 words. Figure [T] 
presents a detailed quantitative comparison of the 
best performance of each measure with respect to 
each type of co-occurrence for a range of different 
span constraints on the sim data set (Similar results 
were obtained on other data sets). The inferences we 
can draw are consistent with the results of Table @] 

In our next experiment, we examine which of the 
four types of co-occurrences are best captured by 
each measure. Results for the sim data set are listed 
in Table[5](Similar results were obtained on the other 
data sets). For each measure and for each span con- 
straint, the table describes the best performing pa- 
rameters (e and S), the corresponding co-occurrence 
Type and the associated 'best' correlation achieved 
with respect to the test of Definition [7J . The results 
show that, irrespective of the span constraint, most 
measures perform best on Type A co-occurrences. 
This is reasonable because Type A essentially rep- 
resents the strongest correlations in the data and one 
would expect the measures to capture the strong cor- 
relations better than weaker ones. There are how- 




Figure 1 : Maximum correlation of various measures with various types of CSR for sim dataset 



ever, two exceptions to this rule, namely PMI and 
CWCD, which instead peak at Types C or D. The 
best correlations for these two measures are also typ- 
ically lower than the other measures. We now sum- 
marize the main findings from our study: 

• The relatively obscure Ochiai, and the newly 
introduce CSA are the best performing mea- 
sure, in terms of detecting all types of lexical 
co-occurrences in all data sets and for a wide 
range of span constraints. 

• Dice, LLR and TTest are the other measures 
that effectively track lexically significant co- 
occurrences (although, all three are less effec- 
tive as the span constraints become larger). 

• SCI, CWCD, and the popular PMI measure are 
ineffective at capturing any notion of lexically 
significant co-occurrences, even for small span 
constraints. In fact, the best result for PMI is 
the detection of Type C co-occurrences in the 
sim data set. The low e and high 5 setting of 
Type C suggests that PMI does a poor job of de- 
tecting the strongest co-occurrences in the data, 
overlooking both strong document-level as well 
as corpus-level cues for lexical significance. 

Note that our results do not contradict the utility 
of PMI, SCI, or, CWCD as word-association mea- 
sures. We only observe their poor performance in 
context of detecting lexical co-occurrences. Also, 
our notion of lexical co-occurrence is symmetric. It 
is possible that asymmetric SCI may have competi- 



tive performance for certain asymmetric tasks com- 
pared to the better performing symmetric measures. 
Finally, to give a qualitative feel about the differ- 
ences in the correlations preferred by different meth- 
ods, in Table [6j we show the top 10 bigrams picked 
by PMI and Ochiai for all three datasets. 

5 Relation between lexical co-occurrence 
and human judgements 

While the focus of our work is on characterizing the 
statistically significant lexical co-occurrence, as il- 
lustrated in in Table [7J human judgement of word 
association is governed by many factors in addition 
to lexical co-occurrence considerations, and many 
non co-occurrence based measures have been de- 
signed to capture semantic word association. No- 
table among them are distributional similarity based 
measures ( jAgirre et al., 2009} [Bollegala et al., 2007} 



Chen et al., 2006 1 and knowledge-based measures 



(Milne and Witten, 2008 Hughes and Ramage, 



2007 Gabrilovich and Markovitch, 2007; Yeh et al., 



2009 Strube and Ponzetto, 2006 Finkelstein et al, 



2002[ [Wandm acher et al., 20 08 ). Since our focus is 
on frequency based measures alone, we do not dis- 
cuss these other measures. 

The lexical co-occurrence phenomenon and the 
human judgement of semantic association are re- 
lated but different dimensions of relationships be- 
tween words and different applications may prefer 
one over the other. For example, suppose, given one 
word (say dirt), the task is to choose from among 
a number of alternatives for the second(say grime 
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Table 6: Top 10 bigrams according to PMI and Ochiai rankings on sim, rel, and esslli datasets. 'R' denotes the bigrams 
rankings according to type-A CSR measure(e = 0.1, 5 = 0.1). Span of 25 words is used for all the three measures. 
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Table 7: Top 10 word associations picked in rel dataset. The numbers in the brackets are the cross rankings: CSR 
rankings in the human row and human rankings in the CSR row. CSR parameters are same as that for Table [6] 



and filth). Human judgment scores for (dirt, grime) 
and (dirt, filth) are 5.4 and 6.1 respectively. How- 
ever, their lexical co-occurrence scores (CSR) are 
1.49 and 0.84 respectively. This is because filth is of- 
ten used in a moral context as well. Grime is usually 
used only in a physical sense. Dirt is used mostly 
in a physical sense, but is a bit more generic and 
may be used in a moral sense occasionally. Hence 
(dirt, grime) is more correlated in corpus than (dirt, 
filth). This shows that human judgement is fallible 
and annotators may ignore the subtleties of mean- 
ings that may be picked up by a statistical techniques 
like ours. 

In general, for association with a given word, all 
synonyms of a second word will be given similar 
semantic relatedness score by human judges but they 
may have very different lexical association scores. 

For applications where the notion of statistical 
lexical co-occurrence is potentially more relevant 
than semantic relatedness, our method can be used 
to generate a gold-standard of lexical association 
(against which other association measures can be 
evaluated). In this context, it is interesting to note 
that contrary to the human judgement, each one of 
the co-occurrence measures studied by us finds ( dirt, 
grime) more associated than (dirt, filth). 

Having explained that significant lexical co- 
occurrence is a fundamentally different notion than 
human judgement of word association, we also want 



to emphasize that the two are not completely dif- 
ferent notions either and they correlate reasonably 
well with each-other. For sim, rel, and essli datasets, 
CSR's best correlations with human judgment are 
0.74, 0.65, and 0.46 respectively. Note that CSR is a 
symmetric notion and hence correlates far more with 
human judgement for symmetric sim and rel datasets 
than for the asymmetric essli dataset. Also, at first 
glance, it is little counter-intuitive that the notion of 
lexical co-occurrence yields better correlations with 
the sim (based on similarity) data set when com- 
pared to the rel(based on relatedness) data set. This 
can essentially be explained by our observation that 
similar words tend to co-occur less frequently by- 
chance than the related words. 

6 Conclusions 

In this paper, we introduced the notion of statisti- 
cally significant lexical co-occurrences. We detected 
skews in span distributions of bigrams to assess sig- 
nificance and showed how our method allows clas- 
sification of co-occurrences into different types. We 
performed experiments to assess the performance of 
various frequency -based measures for detecting lex- 
ically signficant co-occurrences. We believe lexi- 
cal co-occurrence can play a critical role in several 
applications, including sense disambiguation, mutli- 
word spotting, etc. We will address some of these in 
our future work. 



References 

[Agirre et al.2009] Agirre, Eneko, Enrique Alfonseca, 
Keith Hall, Jana Kravalova, Marius Pasca, and Aitor 
Soroa. 2009. A study on similarity and relatedness 
using distributional and wordnet-based approaches. In 
NAACL-HLT. 

[Bollegala et al.2007] Bollegala, Danushka, Yutaka Mat- 
suo, and Mitsuru Ishizuka. 2007. Measuring semantic 
similarity between words using web search engines. In 
WWW, pages 757-766. 

[Budanitsky and Hirst2006] Budanitsky, Alexander and 
Graeme Hirst. 2006. Evaluating wordnet-based mea- 
sures of lexical semantic relatedness. Computational 
Linguists, 32(1): 13^47. 

[Chen et al.2006] Chen, Hsin-Hsi, Ming-Shun Lin, and 
Yu-Chuan Wei. 2006. Novel association measures us- 
ing web search with double checking. In ACL. 

[Church and Hanks 1989] Church, Kenneth Ward and 
Patrick Hanks. 1989. Word association norms, mutual 
information and lexicography. In ACL, pages 76-83. 

[Dicel945] Dice, L. R. 1945. Measures of the amount 
of ecological association between species. Ecology, 
26:297-302. 

[Dunningl993] Dunning, Ted. 1993. Accurate methods 
for the statistics of surprise and coincidence. Compu- 
tational Linguistics, 19(l):61-74. 

[ESSLLI2008] ESSLLI. 2008. Free association 
task at lexical semantics workshop esslli 2008. 
|http : //wordspace . collocations . de/ | 
|doku . php/ workshop :esslli:task| 

[Finkelstein et al.2002] Finkelstein, Lev, Evgeniy 
Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, 
Gadi Wolfman, and Eytan Ruppin. 2002. Placing 
search in context: the concept revisited. ACM Trans. 
Inf.Syst.,20{\):\\6-tt\. 

[Gabrilovich and Markovitch2007] Gabrilovich, Evgeniy 
and Shaul Markovitch. 2007. Computing seman- 
tic relatedness using wikipedia-based explicit seman- 
tic analysis. In IJCAI. 

[Hughes and Ramage2007] Hughes, T and D Ramage. 

2007. Lexical semantic relatedness with random graph 
walks. In EMNLP. 

[Jaccardl912] Jaccard, P. 1912. The distribution of the 
flora of the alpine zone. New Phytologist, 1 1:37-50. 

[Janson and Vegelius 1981] Janson, Svante and Jan Veg- 
elius. 1981. Measures of ecological association. Oe- 
cologia, 49:371-376. 

[Kilgarriff2005] Kilgarriff, Adam. 2005. Language is 
never ever ever random. Corpus Linguistics and Lin- 
guistic Theory, l(2):263-276. 

[Milne and Witten2008] Milne, David and Ian H. Witten. 

2008. An effective, low -cost measure of semantic re- 
latedness obtained from wikipedia links. In ACL. 



[Ochiail957] Ochiai, A. 1957. Zoogeografical studies on 
the soleoid fishes found in japan and its neighbouring 
regions-ii. Bulletin of the Japanese Society of Scien- 
tific Fisheries, 22. 

[Pecina and Schlesinger2006] Pecina, Pavel and Pavel 
Schlesinger. 2006. Combining association measures 
for collocation extraction. In ACL. 

[Strube and Ponzetto2006] Strube, Michael and Si- 
mone Paolo Ponzetto. 2006. Wikirelate! computing 
semantic relatedness using wikipedia. In AAAI, pages 
1419-1424. 

[Tan et al.2006] Tan, Pang-Ning, Michael Steinbach, and 
Vipin Kumar. 2006. Chapter 6.7: Evaluation of asso- 
ciation patterns. In Introduction to Data Mining, pages 
379-382. Pearson Education, Inc. 

[Wandmacher et al.2008] Wandmacher, T, E. Ovchin- 
nikova, and T. Alexandrov. 2008. Does latent seman- 
tic analysis reflect human associations? In European 
Summer School in Logic, Language and Information 
(ESSLLI'08). 

[Washtell and Markert2009] Washtell, Justin and Katja 
Markert. 2009. A comparison of windowless and 
window-based computational association measures as 
predictors of syntagmatic human associations. In 
EMNLP, pages 628-637. 

[WikipediaApril 2008] Wikipedia. April 2008. http: 
|/ / www . wikipedia . org[ 

[Yeh et al.2009] Yeh, Eric, Daniel Ramage, Chris Man- 
ning, Eneko Agirre, and Aitor Soroa. 2009. Wiki- 
walk: Random walks on wikipedia for semantic re- 
latedness. In ACL workshop "TextGraphs-4: Graph- 
based Methods for Natural Language Processing" ". 



